Bankruptcy Prediction

Description

A company goes Bankrupt when the assets and shares taken by the company cross the liability. In most cases, bankruptcy models are based on financial indicators that describe the current condition or a certain area of financial health, such as profitability, indebtedness and so on. This research shall analyse the financial statements and market data of these companies and then try to apply several models to determine the bankruptcy. The goal is to find out how far back these models are able to predict that the companies would get into financial distress and which information about the financial ratios would enhance the prediction accuracy of the bankruptcy prediction model.

Icon Legends

Inferences from Outcome
Additional Reads
Lets work!
  1. Import Libraries
  2. Set Options
  3. Read Data
  4. Data Analysis and Preparation
  5. Modeling

1. Importing required Libraries

2. Set Options

3. Reading the dataset

Let's import the dataset and start working

4. Data Analysis and Preparation

Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

We will follow the steps mentioned below to analyse the dataset:-
1. Check dimensions of the dataframe in terms of rows and columns
2. Check the data types. Refer data definition to ensure your data types are correct. If not, make necessary changes
3. Study summary statistics
4. Univariate Analysis: Visualise target and Study distributions of independent variables
5. Detect outliers
6. Check for missing and duplicate values
7. Bivariate Analyis: Study correlation
8. Analyze relationship between target variable and independent variables

4.1 Understanding the Dataset

4.1.1 Data Dimension

There are 6819 rows and 96 columns in the dataset

4.1.2 Checking Data Types

All features are of numeric data type(int or float)

4.1.3 Summary Statistics

Lets perform summary statistics on the dataset and remove insignificant variables
The above output illustrates the summary statistics of all the numeric variables like the mean, median(50%), minimum, and maximum values, along with the standard deviation.
If we observe the count of all the variables, all of them are equal. So we can say that there are probably no missing values in these variables. Also, standard deviation of Net Income Flag is 0, which means that this feature is insignificant for further analysis.

4.1.4 Analyising Target Variable

Lets analyise the target variable "Bankrupt?"
There are way more Healthy firms than Bankrupt firms in the dataset

4.1.5 Distribution Analysis using measure of kurtois, skewness and distribution plots

Lets analyse distribution of each variable in the dataset using the measure kurtosis and skewness. Then we will plot distribution plots for every variable.
As seen in our kurtosis and skewness anaylsis as well, most features are skewed and leptokurtic, so median will be a better measure of central tendency

4.1.6 Outlier Analysis using Boxplots

Lets analyse outliers for each variable in the dataset using Boxplots.
There are alot of outliers in most of the features, removing these outliers might improve the performance of our model(s)

4.1.7 Null and Duplicate values

There are no null or duplicate values in the dataset

4.1.8 Correlation Analysis

Correlation is a statistic that measures the degree to which two variables move in relation to each other.

In order to compute the correlation matrix, we perform the following:
1. Call the corr() function which will return the correlation matrix of numeric variables
2. Pass the correlation matrix to the heatmap() function of the seaborn library to plot the heatmap of the correlation matrix
3. Extract list of highly correlated pairs (corr > 0.7)
There seems to be a high - extreme correlation between some of the independent variables

4.1.9 Bivariate and Multivariate Analysis

In order to perform Bivariate and Multivariate Analysis, we perform the following:
1. Plot barplots to compare medians of Bankrupt and healthy firms for all features
2. Identify pairs with extremly high correlation
3. Plotting a scatterplot of those pairs with hue as "Bankrupt?"
We see multiple relatioships from our Bivariate and multivariate analysis, we analyse these relationships in details in the next section.

Data Analysis and EDA Inferences

Univarite Analysis

  1. We found that Nett income flag and Liability assets flag would be insignificnt for further analysis as they're not rich with information.

  2. There are way more healthy firms than Bankrupt in the target feature, pointing to the fact that there might be a need for up-sampling.

  3. Most of the features are highly skewed and leptokurtic, median would be a better measure of central tendency and transformation techniques need to be applied.

  4. Most features are rich in outliers, some of the features also have incorrect values/mistakes at the far end. Removing/treating these outliers might help increase accuracy of our models. There are quiet a few approaches we could take to treat them. One method thats gaining attraction in the industry is the KNNimputer. We could also remove all the outliers or we could impute them with the median values of the features. We could also take out values that seem like mistakes i.e. greater than 1 for some of the financial ratios and try different transformation techniques on the resultant data. Another way to treat the outliers could be by imputing them with median of healthy and bankrupt firms of that column i.e. imputing outlier values of a feature that fall under bankruptcy with median of the values of that feature that fall under bankrupt firms and the same for healthy firms

  5. There are no null Values in any of the columns

Scaling and Transformation

As most of the data is ranged between 0 - 1, we can use min-max scalling for columns that are not contained within this range if needed during modelling, it should be kept in mind that this will not help treat outliers/skewness but only bring everything to scale.

We can apply log(x+1) transformation on the columns with skewness more than 1 or less than -1, we're suggesting log(x+1) because most of the values are really small to begin with and applying direct log might not yield desirable results

Bivariate and Multivariate Analysis

- Correlation Analysis

We found that alot of the independent variables are correlated with each other pointing to moderate-high multicolinearity, we identified 77 distinct pairs of independent features that had correlation greater than 0.7 or less than -0.7.

- Ind-Dep Feature Bar Charts

We plotted bar charts with Bankrupt? on x-axis and median of each independent feature on y-axis separately to see whether there were any clear relationships we could identify, we found that

- Scatterplots Of Extremly Correlated Features

We plotted scatterplots for extremely correlated features(corr > .90) with hue as bankruptcy to see whether there are any evident clusters, we found that

4.2 Data Preparation & Feature Engineering

Data preparation is a process of cleansing data to get it ready for modelling.

We will follow the steps mentioned below to prepare the dataset:-
1. Remove redundant columns based on Domain knowledge
2. Remove outliers using IQR method
3. Impute null values created due to outliers removal
4. Reduce mutlicolinearity using Variance Inflation Factor
5. Transform skewed variables
6. Train - Test split

4.2.1 Removal of redundant features based on Domain knowledge

4.2.2 Outlier Removal using IQR

Lets remove outliers using the IQR method.
As we can see all the outliers have been removed. We can see new outliers which is expected.

4.2.3 Null value treatment

Lets treat the null values resulting from removal of outliers.

We will follow the steps mentioned below to treat null values:-
1. Find features with high null values and see if they can be dropped
2. Transform the skewed columns using square root transformation
3. Scale the data to get it ready for KNN imputation.
4. Impute null values using KNNImputer
5. Plot the medians to see if there's any deviation
Null value treatment summary and inferences:
1. There were 4 features with above 20% null values but not high enough to be dropped
2. With n_neighbors for KNN imputation as 5, we imputed the nan values.
3. Ploting the medians showed us that the treated dataset roughly overlaps the untreated dataset, pointing to the fact that the orignal dataset has been preserved.
Read more about the KNNImputer and why is it gaining traction in the Data Science community :

KNNImputer: A robust way to impute missing values (using Scikit-Learn)

4.2.4 Multicolinearity reduction

Lets find the features exhibiting high multicolinearity using Variance Inflation Factor and remove them
Columns with VIF > 5 have been removed. There are 48 columns remaining.

4.2.5 Train-Test split

Lets perform the Train-Test split using following steps:-
1. Specify Target and predictor variables.
2. Call train_test_split from sklearn and perform the split.
3. Print the shapes of all the splits.

5. Modeling

Our Data analysis and preparation is complete, now we move onto the next stage where we construct predictive models using various predictive algorithms and select the best one for deployment. Our aim is to maximise recall as letting unhealthy companies slip by is more costly than classifying some of the healthy companies as unhealthy.

We will explore and optimise the following models:-
1. Logistic Regression(Maximum likelyhood estimation)
2. Decision Tree
3. Random Forest
4. K-Nearest Neighbors
5. Naïve Bayes

5.1 Logistic Regression(MLE)

We will follow the following steps for Logisitc Regression:-
1. Build a full Logistic Regression model using Maximum likelyhood estimation
2. Plot Receiver Operating Characteristic(ROC) and get Area under Curve(AUC)
3. Build scorecard to see how different metrics like Precision, Recall, F1-score and kappa are performing under different thresholds.
4. Perform recursive feature elimination(RFE)
5. Build a model with features that survived RFE.
6. Repeat steps 2 and 3 for RFE model
7. Compare both models

5.1.1 Full model

1. Our full Logistic Regression(MLE) model gave us an ROC of 0.9555.
2. Maximum precision was 0.833333 at thresholds 0.9
3. Maximum Recall was 0.948718 at threshold 0.020431 obtained through youdens index.
4. Maximum F1 was 0.483516 at threshold 0.2
5. Maximum kappa was 0.459989 at threshod 0.2

5.1.2 Recursive Feature Elimination : Logistic Regression

1. Our Logistic Regression model after RFE gave us an ROC of 0.9593 which is barely above our full model.
2. Maximum precision was 0.833333 at threshold 0.9.
3. Maximum Recall was 0.974359 at threshold 0.020815 obtained through youdens index.
4. Maximum F1 was 0.5 at threshold 0.3
5. Maximum kappa was 0.480183 at threshod 0.3

5.1.3 Model comparision : RFE vs Full

Our aim is to maximise recall as letting unhealthy companies slip by is more costly than classifying some of the healthy companies as unhealthy. Following that line of logic our recall is maximised with the RFE model at thresh 0.020815. Our best Logit model is the RFE model.

5.2 Decision Tree

XXXXXXXXXXXXXXXXXXX Rough Work XXXXXXXXXXXXXXXXXXXXXXXX
#alternate datasets: KNN without normalisation, KNN with VIF before imputation(df.replace(np.nan,0)), alternate ways to impute: take columns out of scale separately VIF before scalling and imputing ,continue to remove,consider increasing the threshold?, transformation?, tt split? full model logistic regression? rfe? decision trees(Randomforest?). Visualisation?- ppt and docx - default thresholdVIF thresh = 5 K neighbours = 2
- removing ROA(A&B) but keeping C as all these features give return on assets before interest and depreciation, keeping ROA(C) because it has lowest skewness. - removing operating gross margin and realised sales gross margin and keeping Gross Profit to Sales - removing nett value per share(A&C) and keeping Persistent EPS in the Last Four Seasons and NVPS B. - removing Continuous interest rate (after tax):Net Income-Exclude Disposal Gain or Loss/Net Sales and keeping after tax net interest rate:Net Income/Net Sales - Per Share Net profit before tax (Yuan ¥)' and keeping Net profit before tax/Paid-in capital - [' Operating profit/Paid-in capital', ' Operating Profit Per Share (Yuan ¥)']
# decision tree FE: 1. elim using domain 2. outlier analysis with diff thres, Knn imputer etc ,columns that have alot of outliers 3. corr, vifIncome statements: nett sales = Revenue Cost of goods sold = cost Cost of goods sold - nett sales = Gross margin Gross margin = Profit Operating expenses = production cost Gross margin - operating expenses = earning before interest, depreciation, tax & amortisation(ebita) ebita - amortisation - depreciation = ebit ebt = ebit - interest pat/eat/nett profit = ebt - tax pat = profit after tax Eat = earnings after tax Balance sheet: Assets = current assets & fixed assets Liabilities = short term and long term loans Owners equity = fixed assets - long term loans Cash flow statement: realised sales: unrealised sales: return on assests = ebit/total assets- 0,24 = yash - 70-80 - 24,48 = pranavi - 60-60 - 48,72 = Vishal - 80-90 - 72,96 = Apoorva - 90-100- redundant columns - outliers treatment prospects - KNN imputer, iterative imputer - how to decide closeness